-
Notifications
You must be signed in to change notification settings - Fork 84
native build directories hardlinking file fetcher de-duplicates in-flight download requests #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ight download requests
pkg/cas/hardlinking_file_fetcher.go
Outdated
| select { | ||
| case <-d.wait: | ||
| if d.err != nil { | ||
| return d.err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. A caller of GetFile() may receive an error that was produced by another invocation. This should not happen, as errors may also include things like context cancelation, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to retry if it failed because of context cancelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: bb_storage/pkg/auth/remoteAuthorizer.authorizeSingle also retries but has the same kind of flaw. Consider the case where the timeout is 55s, downloading the file takes 50s, one call is coming in every 30s and that the first call is cancelled after 35s. Then all retries for the later requests will also fail because when the retry starts, the context has less than 50s left.
0s RPC 0
30s RPC 1
45s RPC 0 cancelled, RPC 1 retries
60s RPC 2
85s RPC 1 times out (5s left to download), RPC 2 retries
90s RPC 3
115 RPC 2 times out (15s left to download), RPC 3 retries
120s RPC 4
115 RPC 3 times out (15s left to download), RPC 4 retries
...
One solution is to manually cancel a context.Background() when there are no requests left waiting. I don't know how complex the implementation and tests for this kind of solution will be or if this is just not worth bothering with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just eliminate the error propagation logic entirely? Just change that downloads map to:
downloads map[string]<-chan struct{}Only use that to wait an existing download to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated to use the new map.
pkg/cas/hardlinking_file_fetcher.go
Outdated
|
|
||
| type download struct { | ||
| wait chan struct{} | ||
| err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| err error | |
| // err is available only when wait has been closed. | |
| err error |
pkg/cas/hardlinking_file_fetcher.go
Outdated
| } | ||
|
|
||
| type download struct { | ||
| wait chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It is possible to be more strict here with wait <-chan struct{} and further down use
wait := make(chan struct{})
d = &download{wait: wait}
close(wait)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion.
pkg/cas/hardlinking_file_fetcher.go
Outdated
| d = &download{wait: make(chan struct{})} | ||
| ff.downloads[key] = d | ||
| ff.downloadsLock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ff.downloadsLock does not necessarily need to be locked when setting d.err as it is only to be read after d.wait has been closed. This means that defer can be used.
| d = &download{wait: make(chan struct{})} | |
| ff.downloads[key] = d | |
| ff.downloadsLock.Unlock() | |
| wait := make(chan struct{}) | |
| d = &download{wait: wait} | |
| ff.downloads[key] = d | |
| ff.downloadsLock.Unlock() | |
| defer func() { | |
| ff.downloadsLock.Lock() | |
| delete(ff.downloads, key) | |
| ff.downloadsLock.Unlock() | |
| close(wait) | |
| }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated to use defer. The code is cleaner.
pkg/cas/hardlinking_file_fetcher.go
Outdated
| wasMissing, err = ff.tryLinkFromCache(key, directory, name) | ||
| if err == nil { | ||
| // File appeared in cache, no download needed. | ||
| ff.downloadsLock.Lock() | ||
| delete(ff.downloads, key) | ||
| ff.downloadsLock.Unlock() | ||
| close(d.wait) | ||
| return nil | ||
| } else if !wasMissing { | ||
| // tryLinkFromCache had a real error (not just missing). | ||
| ff.downloadsLock.Lock() | ||
| d.err = err | ||
| delete(ff.downloads, key) | ||
| ff.downloadsLock.Unlock() | ||
| close(d.wait) | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment on line 125.
| wasMissing, err = ff.tryLinkFromCache(key, directory, name) | |
| if err == nil { | |
| // File appeared in cache, no download needed. | |
| ff.downloadsLock.Lock() | |
| delete(ff.downloads, key) | |
| ff.downloadsLock.Unlock() | |
| close(d.wait) | |
| return nil | |
| } else if !wasMissing { | |
| // tryLinkFromCache had a real error (not just missing). | |
| ff.downloadsLock.Lock() | |
| d.err = err | |
| delete(ff.downloads, key) | |
| ff.downloadsLock.Unlock() | |
| close(d.wait) | |
| return err | |
| } | |
| wasMissing, d.err = ff.tryLinkFromCache(key, directory, name) | |
| if !wasMissing { | |
| // Either the file appeared in cache, no download needed, | |
| // or tryLinkFromCache had a real error (not just missing). | |
| return d.err | |
| } |
pkg/cas/hardlinking_file_fetcher.go
Outdated
| select { | ||
| case <-d.wait: | ||
| if d.err != nil { | ||
| return d.err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: bb_storage/pkg/auth/remoteAuthorizer.authorizeSingle also retries but has the same kind of flaw. Consider the case where the timeout is 55s, downloading the file takes 50s, one call is coming in every 30s and that the first call is cancelled after 35s. Then all retries for the later requests will also fail because when the retry starts, the context has less than 50s left.
0s RPC 0
30s RPC 1
45s RPC 0 cancelled, RPC 1 retries
60s RPC 2
85s RPC 1 times out (5s left to download), RPC 2 retries
90s RPC 3
115 RPC 2 times out (15s left to download), RPC 3 retries
120s RPC 4
115 RPC 3 times out (15s left to download), RPC 4 retries
...
One solution is to manually cancel a context.Background() when there are no requests left waiting. I don't know how complex the implementation and tests for this kind of solution will be or if this is just not worth bothering with.
HongboDu-at
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review.
pkg/cas/hardlinking_file_fetcher.go
Outdated
| } | ||
|
|
||
| type download struct { | ||
| wait chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion.
pkg/cas/hardlinking_file_fetcher.go
Outdated
| select { | ||
| case <-d.wait: | ||
| if d.err != nil { | ||
| return d.err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated to use the new map.
pkg/cas/hardlinking_file_fetcher.go
Outdated
| d = &download{wait: make(chan struct{})} | ||
| ff.downloads[key] = d | ||
| ff.downloadsLock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated to use defer. The code is cleaner.
EdSchouten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
pkg/cas/hardlinking_file_fetcher.go
Outdated
| if err == nil { | ||
| return nil | ||
| } else if wasMissing { | ||
| return ff.GetFile(ctx, blobDigest, directory, name, isExecutable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're calling GetFile() recursively here. Though I get why you're doing this, I think it hurts readability. Can't we add a for { ... } to the top level of GetFile() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Changed to add a for { ... } to the top level of GetFile() instead
pkg/cas/hardlinking_file_fetcher.go
Outdated
| _, ok := ff.filesSize[key] | ||
| ff.filesLock.RUnlock() | ||
| if !ok { | ||
| return true, os.ErrNotExist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this wasMissingreturn value really needed? Wouldn't it be sufficient to just call os.IsNotExist(err) at the call site?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Changed to just call os.IsNotExist(err)
EdSchouten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is starting to shape up nicely. I have one remaining comment. Be sure to address this, and I'll try to merge this quickly. Thanks!
pkg/cas/hardlinking_file_fetcher.go
Outdated
| if downloadErr == nil { | ||
| // The file was downloaded successfully. Place it into the | ||
| // cache, so that successive calls may use it. | ||
| ff.filesLock.Lock() | ||
| if _, ok := ff.filesSize[key]; !ok { | ||
| ff.evictionLock.Lock() | ||
|
|
||
| // Remove old files from the cache if necessary. | ||
| sizeBytes := blobDigest.GetSizeBytes() | ||
| if err := ff.makeSpace(sizeBytes); err != nil { | ||
| downloadErr = err | ||
| } else { | ||
| // Hardlink the file into the cache. | ||
| if err := directory.Link(name, ff.cacheDirectory, path.MustNewComponent(key)); err != nil && !os.IsExist(err) { | ||
| downloadErr = util.StatusWrapfWithCode(err, codes.Internal, "Failed to add cached file %#v", key) | ||
| } else { | ||
| ff.evictionSet.Insert(key) | ||
| ff.filesSize[key] = sizeBytes | ||
| ff.filesTotalSize += sizeBytes | ||
| } | ||
| } | ||
| ff.evictionLock.Unlock() | ||
| } else { | ||
| // The file was already part of our bookkeeping, | ||
| // but was missing on disk. Repair this by adding | ||
| // a link to the newly downloaded file. | ||
| if err := directory.Link(name, ff.cacheDirectory, path.MustNewComponent(key)); err != nil && !os.IsExist(err) { | ||
| downloadErr = util.StatusWrapfWithCode(err, codes.Internal, "Failed to repair cached file %#v", key) | ||
| } | ||
| } | ||
| ff.filesLock.Unlock() | ||
| } | ||
|
|
||
| // Hardlink the file into the cache. | ||
| if err := directory.Link(name, ff.cacheDirectory, path.MustNewComponent(key)); err != nil && !os.IsExist(err) { | ||
| return util.StatusWrapfWithCode(err, codes.Internal, "Failed to add cached file %#v", key) | ||
| } | ||
| ff.evictionSet.Insert(key) | ||
| ff.filesSize[key] = sizeBytes | ||
| ff.filesTotalSize += sizeBytes | ||
| } else if wasMissing { | ||
| // Even though the file is part of our bookkeeping, we | ||
| // observed it didn't exist. Repair this inconsistency. | ||
| if err := directory.Link(name, ff.cacheDirectory, path.MustNewComponent(key)); err != nil && !os.IsExist(err) { | ||
| return util.StatusWrapfWithCode(err, codes.Internal, "Failed to repair cached file %#v", key) | ||
| } | ||
| return downloadErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just trying to figure out why this change is still needed. It probably isn't...? If so, can we revert this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Reverted the this cache insertion part.

From internal usage, this reduces fetching inputs(~4GB) time from ~60s to ~20s.